INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE 



Task-Driven Dictionary Learning 

Julien Mairal — Francis Bach — Jean Ponce 



N° 7400 

September 2010 
Vision, Perception and Multimedia Understanding 



^ apport 

C de recherche 



INSTITUT NATIONAL 
DE RECHERCHE 
EN INFORMATIQUE 
ET EN AUTOMATIQUE 




IN FLIA 



centre de recherche 

PARIS - ROCQUENCOURT 



Task-Driven Dictionary Learning 

Julien Mairal Q , Francis Bach* , Jean PoncdH 

Theme : Vision, Perception and Multimedia Understanding 
Perception, Cognition, Interaction 
Equipe-Projet Willow 

Rapport de recherche n° 7400 — September 2010 — |2"T1 pages 



Abstract: Modeling data with linear combinations of a few elements from a learned dictionary has 
been the focus of much recent research in machine learning, neuroscience and signal processing. For 
signals such as natural images that admit such sparse representations, it is now well established that 
these models are well suited to restoration tasks. In this context, learning the dictionary amounts 
to solving a large-scale matrix factorization problem, which can be done efficiently with classical 
optimization tools. The same approach has also been used for learning features from data for other 
purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks 
has proven to be more difficult. In this paper, we present a general formulation for supervised 
dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving 
the corresponding optimization problem. Experiments on handwritten digit classification, digital 
art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our 
approach is effective in large-scale settings, and is well suited to supervised and semi-supervised 
classification, as well as regression tasks for data that admit sparse representations. 
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Resume : Le codage parcimonieux consiste a representer des signaux comme combinaisons lineaires 
de quelques elements d'un dictionnaire. Cette approche a fait l'objet d'un nombre important de 
travaux en apprentissage statistique, traitement du signal et neuro-sciences. Pour des signaux qui 
admettent des representations parcimonieuses, il est maintenant admis que cette approche permet 
d'obtenir de tres bons resultats en restauration. Dans ce contexte, apprendre le dictionnaire resulte 
en un probleme non convexe de factorisation de matrice, qui peut etre traite efHcacement par des 
outils d'optimisation classique. Cette meme approche a aussi ete utilisee pour des taches autres que 
la reconstruction, comme la classification d'image, mais apprendre le dictionnaire de fagon supervise 
est plus difficile. Nous presentons dans cet article une methode d'apprentissage de dictionnaire 
supervisee, fondee sur un algorithme d'optimisation stochastique, pour des taches de classification 
ou de regression. Les experiences menees en reconnaissance de chiffres, problemes inverses non- 
lineaires dans les images, et codage compresse, montrent que notre approache est efficace a large 
echelle, et permet de resoudre des taches de classification et regression pour des donnees admettant 
des representations parcimonieuses. 

Mots-cles : Basis pursuit, Lasso, apprentissage de dictionnaire, factorisation de matrice, analyse 
en composantes principales parcimonieuses, apprentissage semi-supervise, compressed sensing 
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1 Introduction 

The linear decomposition of data using a few elements from a learned dictionary instead of a prede- 
fined one — based on wavelets pQ for example — has recently led to state-of-the-art results in numerous 
low- level signal processing tasks such as image denoising 0O2I4], audio processing [5l[6]> as well as 
classification tasks [HI [HI EH EH [H| ■ Unlike decompositions based on principal component anal- 
ysis (PCA) and its variants, these sparse models do not impose that the dictionary elements be 
orthogonal, allowing more flexibility to adapt the representation to the data. 

Concretely, consider a vector x in R m . We say that it admits a sparse approximation over a 
dictionary D = [di, . . . ,d p ] in R mxp , when one can find a linear combination of a "few" columns 
from D that is "close" to the vector x. Experiments have shown that modeling signals with such 
sparse decompositions (sparse coding) is very effective in many signal processing applications |13| . 
For natural images, predefined dictionaries based on various types of wavelets |1] have been used for 
this task. Initially introduced by Olshausen and Field [TJ for modeling the spatial receptive fields of 
simple cells in the mammalian visual cortex, the idea of learning the dictionary from data instead of 
using off-the-shelf bases has been shown to significantly improve signal reconstruction [2] . Although 
some of the learned dictionary elements may sometimes "look like" wavelets (or Gabor filters) , they 
are tuned to the input images or signals, leading to much better results in practice. 

This classical data-driven approach to dictionary learning is well adapted to reconstruction tasks, 
such as restoring a noisy signal. These dictionaries, which are good at reconstructing clean signals, 
but bad at reconstructing noise, have indeed led to state-of-the-art denoising algorithms [3l 0j. 
Unsupervised dictionary learning has also been used for other purposes than pure signal reconstruc- 
tion, such as classification [Sj [TTJ [TOJ Q~5] , but recent works have shown that better results can 
be obtained when the dictionary is tuned to the specific task (and not just data) it is intended for. 
Duarte-Carvajalino and Sapiro |I6| have for instance proposed to learn dictionaries for compressed 
sensing, and in [HI El [TO] dictionaries are learned for signal classification. In this paper, we will refer 
to this general approach as task-driven dictionary learning. 

Whereas purely data-driven dictionary learning has been shown to be equivalent to a large-scale 
matrix factorization problem that can be effectively addressed with several methods [131 [13 [TOJ [TO] , 
its task-driven counterpart has proven to be much more difficult to optimize. Presenting a general 
efficient framework for various task-driven dictionary learning problems is the main topic of this 
paper. Even though it is different from existing machine learning approaches, it shares similarities 
with many of them. 

For instance, Blei et al. [20J have proposed to learn a latent topic model intended for document 
classification. In a different context, Argyriou et al. |21) introduced a convex formulation for multi- 
task classification problems where an orthogonal linear transform of input features is jointly learned 
with a classifier. Learning compact features has also been addressed in the literature of neural 
networks, with restricted Boltzmann machines (RBM's) and convolutional neural networks for ex- 
ample (see |22[|2"3"I [24, 25, 26J and references therein). Interestingly, the question of learning the data 
representation in an unsupervised or supervised way has also been investigated for these approaches. 
For instance, a supervised topic model is proposed in |27) and tuning latent data representations for 
minimizing a cost function is often achieved with backpropagation in neural networks [28J . 

1.1 Contributions 

This paper makes three main contributions: 

• It introduces a supervised formulation for learning dictionaries adapted to various tasks instead 
of dictionaries only adapted to data reconstruction. 
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• It shows that the resulting optimization problem is smooth under mild assumptions, and em- 
pirically that stochastic gradient descent addresses it efficiently. 

• It shows that the proposed formulation is well adapted to semi-supervised learning, can exploit 
unlabeled data when they admit sparse representations, and leads to state-of-the-art results 
for various machine learning and signal processing problems. 

1.2 Notation 

Vectors are denoted by bold lower case letters and matrices by upper case ones. We define for 
q > 1, the £ q -noTm of a vector x in R m as ||x|| 9 = (Y^Li | X HI 9 ) 1/ ' 9 ) where x[i] denotes the i-th 
entry of x, and Hx^ = maXt=i ( ... )m |x[i]| = lim^oo ||x||q. We also define the £ -pseudo-norm 
as the number of nonzero elements in a vector. We consider the Frobenius norm of a matrix X 
in R mx ™: ||X||f = E"=i EjLi -^[*> i] 2 ) 1 ^ 2 - We also write for a sequence of vectors x t and scalars ut, 
x t = 0(ut), when there exists a constant K > independent of t so that for all t, ||x t ||2 < Ku t , 
and use a similar notation for matrices (note that for finite-dimensional vector spaces, the choice of 
norm is irrelevant). When A C {1, . . . , m} is a finite set of indices, xa denotes the vector in R' A I 
containing the values of x corresponding to the indices in A. Similarly, when X is a matrix in M. mxn 
and A C {1, . . . , n}, Xa is the matrix in R mx l A l containing the columns of X corresponding to the 
indices in A. 

The rest of this paper is organized as follows: Section [5] briefly recalls the classical data-driven 
dictionary learning framework. Section[3]is devoted to our new task-driven framework, and Section|4] 
to efficient algorithms to addressing the corresponding optimization problems. Section [5] presents 
several dictionary learning experiments for signal classification, signal regression, and compressed 
sensing. 

2 Data-Driven Dictionary Learning 

Classical dictionary learning techniques for sparse coding [T21 US [T5] consider a finite training set of 
signals X = [xi, . . . , x n ] in M. mxn and minimize the empirical cost function 



with respect to a dictionary D in R mxp , each column representing a dictionary element. Here l u is 
a loss function such that £„(x, D) should be small if D is "good" at representing the signal x in a 
sparse fashion. As emphasized by the index u of £ u , this optimization problem is unsupervised. Note 
that, in this setting, overcomplete dictionaries with p > m are allowed. As others (see, e.g., |18|). 
we define £ u (x, D) as the optimal value of a sparse coding problem. In this work, we choose the 
elastic-net formulation of [29 : 



where Ai and A2 are regularization parameters. When A2 = 0, this leads to the i\ sparse decom- 
position problem, also known as basis pursuit [T3j, or Lasso [3UI- Here, our choice of the elastic- net 
formulation over the Lasso is mainly for stability reasons. Using a parameter A2 > makes the 
problem of Eq. ([1]) strongly convex and, as shown later in this paper, ensures its unique solution to 
be Lipschitz with respect to x and D with a constant depending on A2. Whereas the stability of 



5n (D)4-^4( Xi ,D), 



4(x,D) = 




n il n Jll x - Da ll2 + A ilMli + ^IM 



(i) 
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this solution is not necessarily an issue when learning a dictionary for a reconstruction task, it has 
turned out to be empirically important in some of our experiments with other tasks. 

To prevent the ^2-norm of D from being arbitrarily large, which would lead to arbitrarily small 
values of a, it is common to constrain its columns dj., . . . , d p to have ^2-norms less than or equal to 
one. We will call T) the convex set of matrices satisfying this constraint: 

V±{DeR mxp s.t. Vj€{l,...,p}, H-|| 2 <1}. (2) 

As pointed out by Bottou and Bousquet |31| . one is usually not interested in a perfect minimization 
of the empirical cost g n (D), but instead in the minimization with respect to D of the expected cost 

9 (D)=E x [4(x,D)]= lim g n (D) a.s., (3) 

n— > oo 

where the expectation is taken relative to the (unknown) probability distribution p(x) of the data, 
and is supposed to be finite0 In practice, dictionary learning problems often involve a large amount 
of data. For instance when the vectors x represent image patches, n can be up to several millions in a 
single image. In this context, online learning techniques have shown to be very efficient for obtaining 
a stationary point of this optimization problem [19] ■ Our paper exploits this stochastic setting as 
well, by proposing to minimize an expected cost corresponding to a supervised dictionary learning 
formulation, which we now present. 



3 Proposed Formulation 

We introduce in this section a general framework for learning dictionaries adapted to specific super- 
vised tasks, e.g., classification, as opposed to the unsupervised formulation of the previous section, 
and present different extensions along with possible applications. 

3.1 Basic Formulation 

Obtaining a good performance in classification tasks is often related to the problem of finding a good 
data representation. Sparse decompositions obtained with data-driven learned dictionaries have been 
used for that purpose in [5] and [7], showing promising results for audio data and natural images. 
We present in this section a formulation for learning a dictionary in a supervised way for prediction 
tasks (regression or classification). 

Concretely, given a learned dictionary D, a vector x in X C M. m can be represented as a*(x, D), 
where 

a*(x,D)=argmini||x-Da||i + A 1 ||a|| 1 + ^||a||l, (4) 

is the elastic-net solution [29] . 

Now suppose that we are interested in predicting a variable y in y from an observation x, where y 
is either a finite set of labels for classification tasks, or a subset ofW for some integer q for regression 
tasks. Suppose first that D is fixed and obtained with the unsupervised formulation of Eq. §5§ . Using 
a*(x, D) as features for predicting the variable y, we can learn model parameters W by solving: 

min/(W) + V -\\W\\l, (5) 

where W is a convex set, v is a regularization parameter, and / is defined as 

/(W)^E y , x [4(y,W,a*(x,D))], (6) 

1 We use "a.s." (almost sure) to denote convergence with probability one. 
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where £ s is a convex loss function, the index s of t s indicating that the loss is adapted to a su- 
pervised learning problem. The expectation is taken with respect to the unknown probability dis- 
tribution p(y, x) of the data. Depending on the setting (classification or regression), several loss 
functions can be used such as square, logistic, or hinge loss from support vector machines (see |32| 
and references therein), possibly with a nonlinear kernel as done in [TJ. Note that we do not specify 
yet the size of the matrix W since it will depend on the chosen application, as illustrated later in 
Section 13.31 

The formulation of Eq. (O exploits features a*(x, D), where D is given, possibly learned with the 
unsupervised formulation from Section [2] However, Mairal et al. [9], and Bradley and Bagnell [TO] 
have shown that better results can be achieved when the dictionary is obtained in a fully super- 
vised setting, tuned for the prediction task. We now introduce the task-driven dictionary learning 
formulation, that consists of jointly learning W and D by solving 

De ^ e w /(D ' w ) + i" w "- < 7 > 

where I? is a set of constraints defined in Eq. $2$, and / has the form 

/(D,W) 4E y , x [4(y,W,a*(x,D))]. (8) 

The main difficulty of this optimization problem comes from the non-differentiability of a* , which 
is the solution of a nonsmooth optimization problem Bradley and Bagnell [TU] have tackled 
this difficulty by introducing a smooth approximation of the sparse regularization which leads to 
smooth solutions, allowing the use of implicit differentiation to compute the gradient of the cost 
function they have considered. This approximation encourages some coefficients in a* to be small, 
and does not produce true zeros. It can be used when "true" sparsity is not required. In a different 
formulation, Mairal et al. [9] have used nonsmooth sparse regularization, but used heuristics to tackle 
the optimization problem. We show in Section 2] that better optimization tools than these heuristics 
can be used, while keeping a nonsmooth regularization for computing a* . 

Small variants of this formulation can also be considered: Non-negativity constraints may be 
added on a* and D, leading to a supervised version of nonnegative matrix factorization |33| . reg- 
ularized with a sparsity-inducing penalty. The function £ s could also take extra arguments such 
as D and x instead of just y, W, a*. For simplicity, we have omitted these possibilities, but the 
formulations and algorithms we present in this paper can be easily extended to these cases. 

Before presenting extensions and applications of the formulation we have introduced, let us first 
discuss the assumptions under which our analysis holds. 

3.1.1 Assumptions 

From now on, we suppose that: 

(A) The data (y,x) admits a probability density p with a compact support Ky x Kx C y x X. 
This is a reasonable assumption in audio, image, and video processing applications, where it 
is imposed by the data acquisition process, where values returned by sensors are bounded. To 
simplify the notations later in the paper, we suppose from now on that X and y are compact 

(B) When y is a subset of a finite-dimensional real vector space, p is continuous and l s is twice 
continuously differentiable. 

2 Even though images are acquired in practice after a quantization process, it is a common assumption in image 
processing to consider pixel values in a continuous space. 
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(C) When y is a finite set of labels, for all y in y, p(y, .) is continuous and £ s (y,.) is twice 
continuously differentiableU 

Assumptions (B) and (C) allow us to use several loss functions such as the square, logistic, or 
softmax losses. 

3.2 Extensions 

We now present two extensions of the previous formulations. The first one includes a linear transform 
of the input data, and the second one exploits unlabeled data in a semi-supervised setting. 

3.2.1 Learning a Linear Transform of the Input Data 

In this section, we add to our basic formulation a linear transform of the input features, represented 
by a matrix Z. Our motivation for this is twofold: It can be appealing to reduce the dimension of 
the feature space via such a linear transform, and/or it can make the model richer by increasing the 
numbers of free parameters. The resulting formulation is the following: 

De ^w. /(D ' w ' z) + T l|w|ll+ y ||z|11 ' (9) 

where v\ and vi are two regularization parameters, Z is a convex set and 

/(D,W,Z)4E y , x [4(y,W,a*(Zx,D))]. (10) 

It is worth noticing that the formulations of Eq. ([7]) and Eq. ^ can also be extended to the case of 
a cost function depending on several dictionaries involving several sparse coding problems, such as 
the one used in [8] for signal classification. Such a formulation is not developed here for simplicity 
reasons, but algorithms to address it can be easily derived from this paper. 

3.2.2 Semi-supervised Learning 

As shown in [7], sparse coding techniques can be effective for learning good features from unlabeled 
data. The extension of our task-driven formulation to the semi-supervised learning setting is natural 
and takes the form 

De min ew (l - /x)E y , x [4(y, W,a*(x,D))] + M E x [4(x,D)] + |||W||f, (11) 

where the second expectation is taken with respect to the marginal distribution of x. The function t u 
is the loss function defined in Eq. ([T]), and fi in [0, 1] is a new parameter for controlling the trade-off 
between the unsupervised learning cost function and the supervised learning one. 

3.3 Applications 

For illustration purposes, we present a few applications of our task-driven dictionary learning formu- 
lations. Our approach is of course not limited to these examples. 

3 For a given value of y and function g, g(y, .) denotes the function which associates to a vector x the value g(y, x). 
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3.3.1 Regression 

In this setting, y is a subset of a g-dimensional real vector space, and the task is to predict variables y 
in y from the observation of vectors x in X . A typical application is for instance the restoration 
of clean signals y from observed corrupted signals x. Classical signal restoration techniques often 
focus on removing additive noise or solving inverse linear problems |34) . When the corruption results 
from an unknown nonlinear transformation, we formulate the restoration task as a general regression 
problem. This is the case for example in the experiment presented in Section [5.31 
We define the task-driven dictionary learning formulation for regression as follows: 



min E v x 
wew.Dev " 



i||y-Wa*(x,D)||2 +-||W||f. (12) 



At test time, when a new signal x is observed, the estimate of the corresponding variable y provided 
by this model is Wa*(x, D) (plus possibly an intercept which we have omitted here for simplicity 
reasons). Note that we here propose to use the square loss for estimating the difference between y 
and its estimate Wa*(x, D), but any other twice differentiable loss can be used. 



3.3.2 Binary Classification 

In this section and in the next one, we propose to learn dictionaries adapted to classification tasks. 
Our approach follows the formulation presented in [3J, but is slightly different and falls into our 
task-driven dictionary learning framework. In this setting, the set y is equal to { — 1; +1}- Given a 
vector x, we want to learn the parameters w in W of a linear model to predict y in y, using the 
sparse representation ct*(x, D) as features, and jointly optimize D and w. For instance, using the 
logistic regression loss, our formulation becomes 

(13) 

Once D and w have been learned, a new signal x is classified according to the sign of w t q;*(x, D). 
For simplicity reasons, we have omitted the intercept in the linear model, but it can easily be included 
in the formulation. Note that instead of the logistic regression loss, any other twice differentiable 
loss can be used as well. 

As suggested in [3], it is possible to extend this approach with a bilinear model by learning a 
matrix W so that a new vector x is classified according to the sign of x T Wa*(x, D). In this setting, 
our formulation becomes 

+ ^I|W|||. (14) 

This bilinear model requires learning pm parameters as opposed to the p parameters of the linear 
one. It is therefore richer and can sometimes offer a better classification performance when the linear 
model is not rich enough to explain the data, but it might be more subject to overfitting. 

Note that we have naturally presented the binary classification task using the logistic regression 
loss, but as we have experimentally observed, the square loss is also an appropriate choice in many 
situations. 



mm 1 

wflP.DPD 



logfl 



-yw ct*(x.D)^ 



mm 

W<=R m xp.r>(=7-> 



E 



;/.x 



log(l + e^ xTwa *( x < D )) 



3.3.3 Multi-class Classification 

When y is a finite set of labels in {l,...,q} with q > 2, extending the previous formulation to 
the multi-class setting can be done in several ways, which we briefly describe here. The simplest 
possibility is to use a set of binary classifiers presented in Section 13.3.21 in a "one-vs-all" or "one-vs- 
one" scheme. Another possibility is to use a multi-class cost function such as the soft-max function, 
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to find linear predictors w^, k in {1, . . . , q} such that for a vector x in X, the quantities wja:*(x, D) 
are encouraged to be greater than w A To:*(x, D) for all k ^ y. Another possibility is to turn the multi- 
class classification problem into a regression one and consider that y is a set of q binary vectors of 
dimension q such that the k— th vector has 1 on its fc-th coordinate, and elsewhere. This allows 
using the regression formulation of Section 13.3.11 to solve the classification problem. 



3.3.4 Compressed sensing 

Let us consider a signal x in R m , the theory of compressed sensing [35 , 36J tells us that under certain 
assumptions, the vector x can be recovered exactly from a few measurements Zx, where Z in K rXTn 
is called a "sensing" matrix with r <C m. Unlike classical signal processing methods, such a linear 
transformation is sometimes included physically in the data acquisition process itself |37) . meaning 
that a sensor can provide measurements Zx without directly measuring x at any moment. 

In a nutshell, the recovery of x has been proven to be possible when x admits a sparse repre- 
sentation on a dictionary D, and the sensing matrix Z is incoherent with D, meaning that the rows 
of Z are sufficiently uncorrelated with the columns of D (see [351 136| for more details)! To ensure 
that this condition is satisfied, Z is often chosen as a random matrix, which is incoherent with any 
dictionary with high probability. 

The choice of a random matrix is appealing for many reasons. In addition to the fact that it 
provides theoretical guarantees of incoherence, it is well suited to the case where m is large, making 
it impossible to store a deterministic matrix Z into memory, whereas it is sufficient to store the 
seed of a random process to generate a random matrix. On the other hand, large signals can often 
be cut into smaller parts that still admit sparse decompositions, e.g., image patches, which can be 
treated independently with a deterministic smaller matrix Z. When this is the case or when m has 
a reasonable size, the question of whether to use a deterministic matrix Z or a random one arises, 
and it has been empirically observed that learned matrices Z can outperform random projections: 
For example, it is shown in [38J that classical dimensionality reduction techniques such as principal 
component analysis (PCA) or independent component analysis (ICA) could do better than random 
projections in noisy settings, and in |16j that jointly learning sensing matrices and dictionaries can 
do even better in certain cases. A Bayesian framework for learning sensing matrices in compressed 
sensing applications is also proposed in |39| . 

Following the latter authors, we study the case where Z is not random but learned at the same 
time as the dictionary, and introduce a formulation which falls into out task-driven dictionary learning 
framework: 



mm 

wei mX 
zm rx " 



-||y-Wa*(Zx,D)||| + ^||W||| + nr.. 



where we learn D, W and Z so that the variable y should be well reconstructed when encoding the 
"sensed" signal Zx with a dictionary D. In a noiseless setting, y is naturally set to the same value 
as x. In a noisy setting, it can be a corrupted version of x. 

After having presented our general task-driven dictionary learning formulation, we present next 
a strategy to address the corresponding nonconvex optimization problem. 



4 The assumption of "incoherence" between D and Z can be replaced with a different but related hypothesis called 
restricted isometry property. Again the reader should refer to |35J 136] for more details. 
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4 Optimization 

We first show that the cost function / of our basic formulation ((?]) is differentiable and compute its 
gradient. Then, we refine the analysis for the different variations presented in the previous section, 
and describe an efficient online learning algorithm to address them. 



4.1 Differentiability of / 

We analyze the differentiability of / as defined in Eq. (JTJ) with respect to its two arguments D and W. 
We consider here the case where y is a compact subset of a finite dimensional real vector space, but 
all proofs and formulas are similar when y is a finite set of labels. The purpose of this section 
is to show that even though the sparse coefficients a* are obtained by solving a non-differentiable 
optimization problem, / is differentiable on W x D, and one can compute its gradient. 

The main argument in the proof of Propositions Q] and [5] below is that, although the func- 
tion a*(x, D) is not differentiable, it is uniformly Lipschitz continuous, and differentiable almost 
everywhere. The only points where a* is not differentiable are points where the set of nonzero coeffi- 
cients of a* change (we always denote this set by A in this paper). Considering optimality conditions 
of the elastic-net formulation of Eq. (HJ), these points are easy to characterize. The details of the 
proof have been relegated to the Appendix (Lemma [T] and Proposition [3]) for readability purposes. 
With these results in hand, we then show that / admits a first-order Taylor expansion meaning that 
it is differentiable, the sets where a* is not differentiable being negligible in the expectation from 
the definition of / in Eq. @. We can now state our main result: 

Proposition 1 (Differentiability and gradients of /) 

Assume A2 > 0, (A.), (B) and (C). Then, the function f defined in Eq. (pj) is differentiable, and 
f V w /(D,W)=E y , x [Vw4(y,W,a*)], 

\ V D /(D,W)=E y , x [-D/3W T + (x-Da*)/3* T ], 1 
where a* is short for a*(x, D), and (3* is a vector in W that depends on y, x, W, D with 

f3* A c = and (3% = (DjD A + X 2 iy 1 W a J s (y, W, a*), (17) 
where A denotes the indices of the nonzero coefficients o/a*(x, D). 

The proof of this proposition is given in Appendix. We have shown that the function defined in Eq. ([7]) 
is smooth, and computed its gradients. The same can be done for the more general formulation of 
Eq. (HT 



Proposition 2 (Differentiability, extended formulation) 

Assume A2 > 0, (A.), (B) and (C). Then, the function f defined in Eq. is differentiable. The 

gradients of f are 

'Vw/(D,W,Z)=E y>x [Vw4(y,W,a*)], 

< V D /(D,W,Z) =E y , x [-D/3V T + (Zx-Da*)/3 4T ], (18) 
k V Z /(D,W,Z) =E y , x [D/Tx T ], 

where a* is short for c**(Zx, D), and (3* is defined in Eq. H7\) . 

The proof is similar to the one of Proposition [1] in Appendix, and uses similar arguments. 
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4.2 Algorithm 

Stochastic gradient descent algorithms are typically designed to minimize functions whose gradients 
have the form of an expectation as in Eq. (|16[) . They have been shown to converge to stationary points 
of (possibly nonconvex) optimization problems under a few assumptions that are a bit stricter than 
the ones satisfied in this paper (see [21] and references therein) As noted in [T5]> these algorithms 
are generally well suited to unsupervised dictionary learning when their learning rate is well tuned. 

The method we propose here is a projected first-order stochastic gradient algorithm (see |40|). and 
it is given in Algorithm[T] It sequentially draws i.i.d samples (y t , x t ) from the probability distribution 
p(y, x). Obtaining such i.i.d. samples may be difficult since the density p(y, x) is unknown. At first 
approximation, the vectors (y t ,x t ) are obtained in practice by cycling over a randomly permuted 
training set, which is often done in similar machine learning settings |41j . 



Algorithm 1 Stochastic gradient descent algorithm for task-driven dictionary learning. 
Require: p(y, x) (a way to draw i.i.d samples of p), Ai, A2, v € K (regularization parameters), Del? 
(initial dictionary), W 6 W (initial parameters), T (number of iterations), t$,p (learning rate 
parameters) . 
l: for t = 1 to T do 
2: Draw (y t , x t ) from p(y, x). 

3: Sparse coding: compute a* using a modified LARS [42] . 

* -In t-*ii9 \ 11 1 1 ^2 11 1 1 9 

a <— argmm - \\x t — Dai-W^ + Ai||a||i + — 1 1 cat 1 1 2 - 

raeRp 2 ~ 2 

4: Compute the active set: 

A^{i6{l,... )P }:a*[i]^0}. 

5: Compute (3*: Set f3\ c = and 

ft = (DjD A + A2I)- 1 V a J a (y t , W, a*). 



6: Choose the learning rate p t ^— min (p, p^f\ . 

7: Update the parameters by a projected gradient step 



W <- n w 
d <— rix) 



W - p t (V w 4(y t , W, a*) + z/W 
D - p t ( - D/3*a* T + (x t - Da*)/3 



where ilyy and Il-p are respectively orthogonal projections on the sets VV and T>. 
8: end for 

9: return D (learned dictionary). 



At each iteration, the sparse code a*(x t ,D) is computed by solving the elastic-net formulation 
of [29]. We have chosen to use the LARS algorithm, a homotopy method [42J . which was originally 
developed to solve the Lasso formulation — that is, A2 = 0, but which can be modified to solve the 

5 As often done in machine learning, we use stochastic gradient descent in a setting where it is not guaranteed to 
converge in theory, but is has proven to behave well in practice, as shown in our experiments. The convergence proof 
of Bottou 1311 for non-convex problems indeed assumes three times differentiable cost functions. 
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elastic-net problem. Interestingly, it admits an efficient implementation that provides a Cholesky 
decomposition of the matrix (D^Da + A2l) _1 (see 02]) as well as the solution a*. In this 
setting, (3 can be obtained without having to solve from scratch a new linear system. 

The learning rate p t is chosen according to a heuristic rule. Several strategies have been presented 
in the literature (see [2"8l H5] and references therein) . A classical setting uses a learning rate of the 
form p/t, where p is a constant^) However, such a learning rate is known to decrease too quickly in 
many practical cases, and one sometimes prefers a learning rate of the form p/(t + to), which requires 
tuning two parameters. In this paper, we have chosen a learning rate of the form min(p, pto/t) — that 
is, a constant learning rate p during to iterations, and a 1/t annealing strategy when t > to , a strategy 
used by [43J for instance. Finding good parameters p and to also requires in practice a good heuristic. 
The one we have used successfully in all our experiments is to = T/10, where T is the total number 
of iterations. Then, we try several values of p during a few hundreds of iterations and keep the one 
that gives the lowest error on a small validation set. 

In practice, one can also improve the convergence speed of our algorithm with a mini-batch 
strategy — that is, by drawing rj > 1 samples at each iteration instead of a single one. This is a 
classical heuristic in stochastic gradient descent algorithms and, in our case, this is further motivated 
by the fact that solving rj elastic-net problems with the same dictionary D can be accelerated by the 
precomputation of the matrix D T D when r\ is large enough. Such a strategy is also used in |19| for 
the classical data-driven dictionary learning approach. In practice, the value r\ = 200 has given good 
results in all our experiments (a value found to be good for the unsupervised setting as well). 

As many algorithms tackling non-convex optimization problems, our method for learning su- 
pervised dictionaries can lead to poor results if is not well initialized. The classical unsupervised 
approach of dictionary learning presented in Eq. §3§ has been found empirically to be better behaved 
than the supervised one, and easy to initialize |19] . We therefore have chosen to initialize our dic- 
tionary D by addressing the unsupervised formulation of Eq. (|3]) using the SPAMS toolbox [19 
With this initial dictionary D in hand, we optimize with respect to W the cost function of Eq ([3]) , 
which is convex. This procedure gives us a pair (D,W) of parameters which are used to initialize 
Algorithm [1] 



4.3 Extensions 



We here present the slight modifications to Algorithm Q] necessary to address the two extensions 
discussed in Section I5T21 

The last step of Algorithm Q] updates the parameters D and W according to the gradients 
presented in Eq. (JTSJ). Modifying the algorithm to address the formulation of Section f3 . 2 . 1 1 also 
requires updating the parameters Z according to the gradient from Proposition [5J 



Z-p f (D/3*x T +v 2 Z) 



where II ^ denotes the orthogonal projection on the set Z. 

The extension to the semi-supervised formulation of Section 13.2.21 assumes that one can draw 
samples from the marginal distribution p(x). This is done in practice by cycling over a randomly 
permuted set of unlabeled vectors. Extending Algorithm Q] to this setting requires the following 
modifications: At every iteration, we draw one pair (y t , x f ) from p(y, x) and one sample xj fromp(x). 
We proceed exactly as in Algorithm [TJ except that we also compute a*' = a*(x' f ,D), and replace 
the update of the dictionary D by 



D <— II-p 



D - p t ( (1 - p) ( - B(3*a* T + ( Xt - Da*)/3* 



p> 



- (x' t - Da*')a 



(19) 



6 A 1 /t-asymptotic learning rate is usually used for proving the convergence of stochastic gradient descent algo- 
rithms |3 1 1 . 

Ihttp : //www . di ■ ens ■ f r/wili ow/SPAMS/ 
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where the term — (xj — Da*')a*' T is in fact the gradient Vd4(x(, D), as shown in [19J. 

5 Experimental Validation 

Before presenting our experiments, we briefly discuss the question of choosing the parameters of our 
approach. 

5.1 Choosing the Parameters 

Performing cross-validation on the parameters Ai, A2 (elastic-net parameters), v (regularization 
parameter) and p (size of the dictionary) would of course be cumbersome, and we use a few simple 
heuristics to either reduce the search space for these parameters or fix arbitrarily some of them. We 
have proceeded in the following way: 

• Since we want to exploit sparsity, we often set A2 to 0, even though A2 > is necessary in our 
analysis for proving the differentiability of our cost function. This has proven to give satisfactory 
results in most experiments, except for the experiment of Section 15.51 where choosing a small 
positive value for A2 was necessary for our algorithm to converge. 

• We have empirically observed that natural image patches (that are preprocessed to have zero- 
mean and unit ^-norm) are usually well reconstructed with values of Ai around 0.15 (a value 
used in |9J for instance), and that one only needs to test a few different values, for instance 
Ai = 0.15 + 0.025/c, with k e {-3, . . . , 3}. 

• When there is a lot of training data, which is often the case for natural image patches, the 
regularization with v becomes unnecessary and this parameter can arbitrarily set to a small 
value, e.g., v = 10~ 9 for normalized input data. When there are not many training points, this 
parameter is set up by cross-validation. 

• We have also observed that a larger dictionary usually means a better performance, but a higher 
computational cost. Setting the size of the dictionary is therefore often a trade-off between 
results quality and efficiency. In our experiments, we often try the values p in {50, 100, 200, 400}. 

We show in this section several applications of our method to real problems, starting with handwritten 
digits classification, then moving to the restoration of images damaged by an unknown nonlinear 
transformation, digital art authentification, and compressed sensing. 

5.2 Handwritten Digits Classification 

We choose in this section a discriminative task that consists of classifying digits from the MNIST [H] 
and USPS [453 handwritten datasets. MNIST is made of 70 000 28 x 28 images, 60 000 for training, 
10 000 for testing, each of them containing one handwritten digit. USPS is composed of 7291 training 
images and 2007 test images of size 16 x 16. 

We choose to address this multiclass classification problem with a one-vs-all strategy, learning 
independently 10 dictionaries and classifiers per class, using the formulation of Section 13.3.21 This 
approach has proven in our case to be more scalable than learning a single large dictionary with a 
multi-class loss, while providing very good results. In this experiment, the Lasso [30] is preferred 
to the elastic-net formulation |29| . and A2 is set to 0. All the digits are preprocessed to have zero- 
mean and are normalized to have unit ^2-norm. We try the parameters Ai = 0.15 + 0.025fc, with 
k G {—3, . . . , 3}, for the reasons mentioned in the previous section, and v in {10 _1 , . . . , 10 -6 }. For 
choosing the parameters, we use for MNIST the last 10 000 digits of the training set for validation, 
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Table 1: Test error in percent of our method for the digit recognition task for different dictionary 
sizes p. 



D 


unsupervised 


supervised 


P 


50 


100 


200 


300 


50 


100 


200 


300 


MNIST 


5.27 


3.92 


2.95 


2.36 


.96 


.73 


.57 


.54 


USPS 


8.02 


6.03 


5.13 


4.58 


3.64 


3.09 


2.88 


2.84 



and train on the first 50 000 ones. For USPS, we proceed in a similar way, keeping 10% of the training 
set for validation. Note that a full cross-validation scheme may give better results, but it would be 
computationally more expensive. 

Most effective digit recognition techniques in the literature use features with shift invariance 
properties |241 [415] . Since our formulation is less sophisticated than for instance the convolutional 
network architecture of [24J and does not enjoy such properties, we have chosen to artificially augment 
the size of our training set by considering versions of the digits that are shifted by one pixel in every 
direction. This is of course not an optimal way of introducing shift invariance in our framework, but 
it is fairly simple. 

After choosing the parameters using the validation set (and of course none of the testing set) , we 
retrain our model on the full training set. Each experiment is performed with 40 000 iterations of 
our algorithm with a mini-batch of size 200. To study the influence of the dictionary size, we report 
the performance on the test set achieved for different dictionary sizes, with p in {50, 100, 200, 300} 
for the two datasets. We observe that learning D in a supervised way significantly improves the 
performance of the classification. Moreover our method achieves state-of-the-art results on MNIST 
with a 0.54% error rate, which is similar to the 0.60% error rate of Our 2.84% error rate on 

USPS is slightly behind the 2.4% error rate of gBJ . 

Our second experiment follows |24) . where only a small percentage of the training samples are 
actually labelled. We use the semi-supervised formulation of Section 13.2.21 which exploits unlabeled 
data. Unlike the first experiment where the parameters are chosen using a validation set, and 
following [24], we make a few arbitrary choices. Indeed, we use p — 300, Ai = 0.075, A2 = and 
v = 10^5, which were the parameters chosen by the validation procedure in the previous experiment. 
The dictionaries associated with each digit class are initialized using the unsupervised formulation 
of Section [21 To measure the performance of our algorithm for different values of p, we use a 
continuation strategy: Starting with p = 1.0, we sequentially decrease its value by 0.1 until we have 
fi = 0, learning with 10 000 iterations for each new value of \x. We report the corresponding error 
rates in Figure[TJ showing that our approach offers a competitive performance similar to [21]. Indeed, 
the best error rates of our method for n = 300, 1000, 5000 labeled data are respectively 5.81, 3.55 and 
1.81%, which is similar to [21] who has reported 7.18, 3.21 and 1.52% with the same sets of labeled 
data. 

5.3 Learning a Nonlinear Image Mapping 

To illustrate our method in the context of regression, we consider a classical image processing task 
called "inverse halftoning". With the development of several binary display technologies in the 70s 
(including, for example, printers and PC screens), the problem of converting a grayscale continuous- 
tone image into a binary one that looks perceptually similar to the original one ("halftoning") was 
posed to the image processing community. Examples of halftoned images obtained with the classical 

8 It is also shown in 1241 that better results can be achieved by considering deformations of the training set. 
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Digit recognition with unlabeled data 



-e- 



-*-n=300 
-e-n=1000 

-0-n=5OOO 



-H- 

-e- 



1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 



Figure 1: Error rates on MNIST when using n labeled data, for various values of \i. 



Floyd-Steinberg algorithm [47] are presented in the second column of Figure [3 with original images 
in the first column. Restoring these binary images to continuous-tone ones ("inverse halftoning") has 
become a classical problem (see [?S] and references therein). 

Unlike most image processing approaches that address this second problem by explicitly modeling 
the halftoning process, we formulate it as a signal regression problem, without exploiting any prior 
on the task. We use a database of 36 images; 24 are high-quality images from the Kodak PhotoCD 
datase10 and are used for training, and 12 are classical images often used for evaluating image 
processing algorithms F°l the first four (house, peppers, cameraman, lena) are used for validation and 
the remaining eight for testing. 

We apply the Floyd-Steinberg algorithm implemented in the LASIP Matlab toolbosB to the 
grayscale continuous-tone images in order to build our training/validation/testing set. We extract 
all pairs of patches from the original/halftoned images in the training set, which provides us with a 
database of approximately 9 millions of patches. We then use the "signal regression" formulation of 
Eq. p2p to learn a dictionary D and model parameters W, by performing two passes of our algorithm 
over the 9 million training pairs. 

At this point, we have learned how to restore a small patch from an image, but not yet how to 
restore a full image. Following other patch-based approaches to image restoration [2], we extract 
from a test image all patches including overlaps, and restore each patch independently so that we get 
different estimates for each pixel (one estimate for each patch the pixel belongs to) . The estimates are 
then averaged to reconstruct the full image. Estimating each patch independently and then averaging 
the estimates is of course not optimal, but it gives very good results in many image restoration 
tasks (see, e.g., [21 HI)- The final image is then post-processed using the denoising algorithm of [3] to 
remove possible artefacts. 

To evaluate our method quantitatively, we measure how well it reconstructs the continuous-tone 
images of the test set from the halftoned ones. To reduce a bit the number of hyperparameters of our 
model, we have made a few arbitrary choices: We also use the Lasso formulation for encoding the 
signals — that is, we set A2 = 0. With millions of training samples, our model is unlikely to overfit and 
the regularization parameter v is set to as well. The remaining free parameters of our approach arc 
the size m of the patches, the size p of the dictionary and the regularization parameter Ai. We have 



'http : //rOk. us/graphics/kodak/ 

10 The list of these images can be found in [4], where they are used for the problem of image denoising. 
11 http : //www . cs . tut . f i/~lasip/ 
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Table 2: Inverse halftoning experiments. Results are reported in PSNR (higher is better). SA-DCT 
refers to 03], LPA-ICI to gH], FIHT2 to [SIJ| and WInHD to [SI]. Best results for each image are in 
bold. 





Validation set 


Test set 


Image 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


FIHT2 


30.8 


25.3 


25.8 


31.4 


24.5 


28.6 


29.5 


28.2 


29.3 


26.0 


25.2 


24.7 


WInHD 


31.2 


26.9 


26.8 


31.9 


25.7 


29.2 


29.4 


28.7 


29.4 


28.1 


25.6 


26.4 


LPA-ICI 


31.4 


27.7 


26.5 


32.5 


25.6 


29.7 


30.0 


29.2 


30.1 


28.3 


26.0 


27.2 


SA-DCT 


32.4 


28.6 


27.8 


33.0 


27.0 


30.1 


30.2 


29.8 


30.3 


28.5 


26.2 


27.6 


Ours 


33.0 


29.6 


28.1 


33.0 


26.6 


30.2 


30.5 


29.9 


30.4 


29.0 


26.2 


28.0 



optimized these parameters by measuring the mean-squared error reconstruction on the validation 
set. We have tried patches of size m — Ixl, with I € {6, 8, 10, 12, 14, 16}, dictionaries of sizes p = 100, 
250 and 500 , and determined Ai by first trying values on the logarithmic scale 10 l , i = —3, 2, then 
refining this parameter on the scale 0.1,0.2,0.3,..., 1.0. The best parameters found are m = 10 x 10, 
p = 500 and Ai = 0.6. Since the test procedure is slightly different from the training one (the test 
includes an averaging step to restore a full image whereas the training one does not), we have refined 
the value of Ai, trying different values one an additive scale {0.4, 0.45, 0.75, 0.8}, and selected 
the value Ai = 0.55, which has given the best result on the validation set. 

Note that the largest dictionary has been chosen, and better results could potentially be obtained 
using an even larger dictionary, but this would imply a higher computational cost. Examples of 
results are presented in Figure [5] Halftoned images are binary but look perceptually similar to the 
original image. Reconstructed images have very few artefacts and most details are well preserved. 
We report in Table [5] a quantitative comparison between our approach and various ones from the 
literature, including the state-of-the-art algorithm of |48| , which had until now the best results on 
this dataset. Even though our method does not explicitly model the transformation, it leads to 
better results in terms of PSNR[3 We also present in Figure [3] the results obtained by applying our 
algorithm to various binary images found on the web, from which we do not know the ground truth, 
and which have not necessarily been obtained with the Floyd-Steinberg algorithm. The results are 
qualitatively rather good. 

From this experiment, we conclude that our method is well suited to (at least, some) nonlinear 
regression problems on natural images, paving the way to new applications of sparse image coding. 



5.4 Digital Art Authentification 

Recognizing authentic paintings from imitations using statistical techniques has been the topic of 
a few recent works |52[ IS5I 154] . Classical methods compare, for example, the kurtosis of wavelet 
coefficients between a set of authentic paintings and imitations [52], or involve more sophisticated 
features [S3]. Recently, Hugues et al. [SI] have considered a dataset of 8 authentic paintings from 
Pieter Bruegel the Elder, and 5 imitations^ They have proposed to learn dictionaries for sparse 

12 Denoting by MSE the mean-squared-error for images whose intensities are between and 255, the PSNR is defined 
as PSNR = 101og 10 (255 2 /MSE) and is measured in dB. A gain of ldB reduces the MSE by approximately 20%. 
13 The origin of these paintings is assessed by art historians. 
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Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even though 
the halftoned images (center column) perceptually look relatively close to the original images (left col- 
umn) , they are binary. Reconstructed images (right column) are obtained by restoring the halftoned 
binary images. Best viewed by zooming on a computer screen. 
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Figure 3: Results on various binary images publicly available on the Internet. No ground truth is 
available for these images from old computer games, and the algorithm that has generated these 
images is unknown. Input images are on the left. Restored images are on the right. Best viewed by 
zooming on a computer screen. 
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coding, and use the kurtosis of th e sp arse coefficients as discriminative features. We use their dataset, 
which they kindly provided to usl 14 l 

The supervised dictionary learning approach we have presented is designed for classifying rela- 
tively small signals, and should not be directly applicable to the classification of large images, for 
which classical computer vision approaches based on bags of words may be better adapted (see |121 155] 
for such approaches). However, we show that, for this particular dataset, a simple voting scheme 
based on the classification of small image patches with our method leads to good results. 

The experiment we carry out consists of finding which painting is authentic, and which one is 
fake, in a pair known to contain one of eachj^f] We proceed in a leave-one-out fashion, where we 
remove for testing one authentic and one imitation paintings from the dataset, and learn on the 
remaining ones. Since the dataset is small and does not have an official training/test set, we measure 
a cross-validation score, testing all possible pairs. We consider 12 x 12 color image patches, without 
any pre-processing, and classify each patch from the test images independently. Then, we use a 
simple majority vote among the test patches to decide which image is the authentic one in the pair 
test, and which one is the imitation! 16 ! 

For each pair of authentic/imitation paintings, we build a dataset containing 200 000 patches 
from the authentic images and 200 000 from the imitations. We use the formulation from Eq. (TT5)) 
for binary classification, and arbitrarily choose dictionaries containing p = 100 dictionary elements. 
Since the training set is large, we set the parameter v to 0, also choose the Lasso formulation for 
decomposing the patches by setting A2 = 0, and cross- validate on the parameter Ai, trying values on 
a grid {10 -4 , 10 -3 , . . . , 10°}, and then refining the result on a grid with a logarithmic scale of 2. We 
also compare Eq. (fT5)l with the logistic regression loss and the basic formulation of Eq. ([5]) where D 
is learned unsupervised. 

For classifying individual patches, the cross-validation score of the supervised formulation is 
a classification rate of 54.04 ± 2.26%, which slightly improves upon the "unsupervised" one that 
achieves 51.94 ± 1.92%. The task of classifying independently small image patches is difficult since 
there is significant overlap between the two classes. On the other hand, finding the imitation in 
a pair of (authentic, imitation) paintings with the voting scheme is easier and the "unsupervised 
formulation" only fails for one pair, whereas the supervised one has always given the right answer in 
our experiments. 



5.5 Compressed Sensing 

In this experiment, we apply our method to the problem of learning dictionaries and projection 
matrices for compressed sensing. As explained in Section l"3. 3. 41 our formulation and the conclusions 
of this section hold for relatively small signals, where the sensing matrix can be stored into memory 
and learned. Thus, we consider here small image patches of natural images of size m = 10 x 10 pixels. 
To build our training/validation/test set, we have chosen the Pascal VOC 2006 database of natural 
images [56]: Images 1 to 3000 are used for training; images 3001 to 4000 are used for validation, and 
the remaining 1304 images are kept for testing. Images are downsampled by a factor 2 so that the 
JPEG compression artefacts present in this dataset become visually imperceptible, thereby improving 
its quality for our experiment. 

We compare different settings where the task is to reconstruct the patches y of size m = 10 x 10, 
from an observation Zx of size r <C m (for instance r = 20 linear measurements), where Z in M. rxm 
is a sensing matrix. For simplicity reasons, we only consider here the noiseless case, where y = x. 

14 It would have been interesting to use the datasets used in 1521 1531 . but they are not publicly available. 

15 This task is of course considerably easier than classifying each painting as authentic or fake. We do not claim to 
propose a method that readily applies to forgery detection. 

16 Note that this experimental setting is different from 1541 . where only authentic paintings are used for training (and 
not imitations). We therefore do not make quantitive comparison with this work. 
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Table 3: Compressed sensing experiment on small natural image patches. The mean squared error 
(MSE) measured on a test set is reported for each scenario with standard deviations, obtained by 
reproducing 5 times each experiment, randomizing the algorithm initializations and the sampling 
of the training images. Each patch is normalized to have unit £2 norm, and the mean squared 
reconstruction error is multiplied by 100 for readability purposes. Here, r is the number of rows of 
the matrix Z. The different scenarios vary with the way D and Z are learned (fixed, unsupervised, 
supervised). See the main text for details. 



z 


RANDOM 


SL1 


PCA 


SL2 


D 


DCT 


UL 


SL 


SL 


DCT 


UL 


SL 


SL 


r = 5 


77.3±4.0 


76.9±4.0 


76.7±4.0 


54.1±1.3 


49.9±0.0 


47.6±0.0 


47.5±0.1 


47.3±0.3 


r = 10 


57.8±1.5 


56.5±1.5 


55.7±1.4 


36.5±0.7 


33.7±0.0 


32.3±0.0 


32.3±0.1 


31.9±0.2 


r = 20 


37.1±1.2 


35.4±1.0 


34.5±0.9 


21.4±0.1 


20.4±0.0 


19.7±0.0 


19.6±0.1 


19.4±0.2 


r = 40 


19.3±0.8 


18.5±0.7 


18.0±0.6 


10.0±0.3 


9.2 ±0.0 


9.1 ±0.0 


9.0 ±0.0 


9.0 ±0.0 



At test time, as explained in Section (|3.3.4[) . our estimate of y is Wa*(Zx,D), where D in M rxp 
is a dictionary for representing Zx, and W in R mx P can be interpreted here as a dictionary for 
representing y. We evaluate several strategies for obtaining (Z,D,W): 

• We consider the case, which we call RANDOM, where the entries of Z are i.i.d. samples of the 
Gaussian distribution A/"(0, 1/^/rn). Since the purpose of Z is to reduce the dimensionality of 
the input data, it is also natural to consider the case where Z is obtained by principal component 
analysis on natural image patches (PCA). Finally, we also learn Z with the supervised learning 
formulation of Eq. (fl"5)) . (SL), but consider the case where it is initialized randomly (SL1) or 
by PCA (SL2). 

• The matrix D can either be fixed or learned. A typical setting would be to set D = ZD', 
where D' is discrete-cosine-transform matrix (DCT), often used in signal processing applica- 
tions [2J. It can also be learned with an unsupervised learning formulation (UL), or a supervised 
one (SL). 

• W is always learned in a supervised way. 

Since our training set is very large (several millions of patches) , we arbitrarily set the regularization 
parameters v\ and v-i to 0. We measure reconstruction errors with dictionaries of various levels of 
overcompleteness by choosing a size p in {100,200,400}. The remaining free parameters are the 
regularization parameters Ai and A2 for obtaining the coefficients a*. We try the values Ai = 10*, 
with i in {—5, . . . , 0}. Unlike what we have done in the experiments of Section 15.31 it is absolutely 
necessary in this setting to use A2 > (according to the theory), since using a zero value for this 
parameter has led to instabilities and prevented our algorithm from converging. We have tried the 
values A2 = 10 l Ai, with i in {—2,-1,0}. Each learning procedure is performed by our algorithm 
in one pass on 10 millions of patches randomly extracted from our training images. The pair of 
parameters (Ai,A2) that gives the lowest reconstruction error on the validation set is selected, and 
we report in Table [3] the result obtained on a test set of 500 000 patches randomly extracted from the 
1304 test images. The conclusions of this compressed sensing experiment on natural image patches 
are the following: 

• When Z is initialized as a Gaussian random matrix (case RANDOM), learning D and Z signif- 
icantly improves the reconstruction error (case SL1). A similar observation was made in [16j. 
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• Results obtained with PCA are in general much better than those obtained with random pro- 
jections, which is consistent with the conclusions of [38 . 

• However, PCA does better than SL1. When PCA is used for initializing our supervised for- 
mulation, results can be slightly improved (case SL2). This illustrates either the limits of the 
non-convex optimization procedure, or that PCA is particularly well adapted to this problem. 

• Learned dictionaries (cases UL and SL) outperform classical DCT dictionaries. 

6 Conclusion 

We have presented in this paper a general formulation for learning sparse data representations tuned 
to specific tasks. Unlike classical approaches that learn a dictionary adapted to the reconstruction of 
the input data, our method learns features in a supervised way. We have shown that this approach 
is effective for solving classification and regression tasks in a large-scale setting, allowing the use of 
millions of training samples, and is able of exploiting successfully unlabeled data, when only a few 
labeled samples are available. Experiments on handwritten digits classification, non-linear inverse 
image mapping, digital art authentification, and compressed sensing have shown that our method 
leads to state-of-the-art results for several real problems. Future work will include adapting our 
method to various image processing problems such as image deblurring and image super-resolution, 
and other inverse problems. 
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A Proofs and Lemmas 

Before giving the proof of Proposition [TJ we present two general results on the elastic net formulation 

of [25]. 

Lemma 1 (Optimality conditions of the elastic net) 

The vector a* is a solution of Eq. ^ if and only if for all j in {1, . . . ,p}, 

dj(x - Da*) - A 2 a*L?'] = Xi sign(a*[j]) if at* [7] ^ 0, 

T ( 20 ) 

|dj (x — Da*) — A2a*[j]| < Ai otherwise. 
Denoting by A = {j 6 {1, . . . ,p} s.t. at*\j] ^ 0} the active set, we also have 

a* 1 = (DXD A + A 2 I)- 1 (DXx-A lSA ), (21) 
where sa in { — 1;+1}' A ' carries the signs of a\. 

Proof. Equation (|20p can be obtained by considering subgradient optimality conditions as done in 
[57] for the case A2 = 0. These can be written as 

e {-D T (x - Da*) + A 2 a* + A lP :pe <9||a*||i}, 
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where 9||a*||i denotes the subdifferential of the l\ norm evaluated at a*. A classical result (see |58) . 
page 238) is that the subgradients p of this subdifferential are characterized by the fact that for all j 
in {1, . . . ,p}, p[j] = sign(a*[j]) if ot*\j] =^ 0, and \p[j]\ < 1 otherwise. This gives directly Eq. ([2"0)l . 
The equalities in Eq. ([2U)l define a linear system whose solution is Eq. (|2"Tj) . ■ 



The next proposition exploits these optimality conditions to characterize the regularity of a* . 

Proposition 3 (Regularity of the elastic net solution) 

Assume A2 > and (A.). Then, 

1. The function a* is uniformly Lipschitz on X x T> . 

2. Let D be in T>, s be a positive scalar and s be a vector in { — 1, 0, and define K S (D, s) C X 
as the set of vectors x satisfying for all j in {1, ... ,p}, 

|dT(x-Da*)-A 2 cr*[7]|<Ai-E if s\j] = 0, 
s\j]ac*\j]>e if B \j]^0. yLi) 

where a* is shorthand for a*(x, D). 

Then, there exists k > independent of s , D and e so that for aHx in K s (T),e), the function a* 
is twice continuously differ entiable on B Ke (x) x B Ke (D), where B Ke (x) and B K£ (D) denote the 
open balls of radius ne respectively centered on x and D. 

Proof. The first point is proven in [TIJ]. The proof uses the strong convexity induced by the 
elastic-net term, when A2 > 0, and the compactness of X from Assumption (A). 

For the second point, we study the differentiability of a* on sets that satisfy conditions which are 
more restrictive than the optimality conditions of Eq. (|20p. Concretely, let D be in D, e > and s 
be in { — 1, 0, +1} P . The set K S (D, e) characterizes the vectors x so that a*(x, D) has the same signs 
as s (and same set of zero coefficients), and a*(x, D) satisfies the conditions of Eq. ([2"U|) . but with 
two additional constraints: (i) The magnitude of the non-zero coefficients in a* should be greater 
than e. (ii) The inequalities in Eq. (|20p should be strict with a margin s. The reason for imposing 
these assumptions is to restrict ourselves to points x in X that have a stable active set — that is, the 
set of non-zero coefficients A of a* should not change for small perturbations of (x, D), when x is in 
K s (D,e). 

Proving that there exists a constant n > satisfying the second point is then easy (if a bit 
technical): Let us suppose that K S (D, e) is not empty (the case when it is empty is trivial). Since a* 
is uniformly Lipschitz with respect to (x, D), so are the quantities dj(x — Da*) — A2«*[j] and 
s[j]a*[j], for all j in {1, . . . ,p}. Thus, there exists n > independent of x and D such that for all 
(x', D') satisfying ||x — x'||2 < ks and ||D — D'||i? < ne, we have for all j in {1, . . . ,p}, 

|dT'( x '-D'a*')-A 2 a*'b]| < Ai - f if s\j] = 0, 
s[j]a*'[j]>! if s[j]^0. 

where a.*' is short-hand for a*(x',D'), and x' is therefore in K S (D' , e/2). It is then easy to show 
that the active set A of a* and the signs of a* are stable on £? Ke (x) x B K£ (D), and that a\ is given 
by the closed form of Eq. (|2"Tj) . a* is therefore twice differentiable on £? K£ (x) x B K£ (D). ■ 

With this proposition in hand, we can now present the proof of Proposition [T] 

Proof. The differentiability of / with respect to W is easy using only the compactness of y and X 
and the fact that i s is twice differentiable. We will therefore focus on showing that / is differentiable 
with respect to D, which is more difficult since a* is not differentiable everywhere. 
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Given a small perturbation E in M. mxp of D, we compute 
/(D + E,W)-/(D,W) = 



Ey,x 



V a ^(a*(x,D + E)-a*(x,D)) + 0(||E|||), (23) 



where V a 4 is short for 'V a £ s (y, W, a*), and the term 0(||E|| F ) comes from the fact that a* is 
uniformly Lipschitz and X x V is compact. 

Let now choose W in W and D in T>. We have characterized in Lemma [3] the differentiability 
of a* on some subsets of X x P. We consider the set 

tf(D,e)4 (J Jf.(D,e), 
se{-i,o,i}p 

and denoting by P our probability measure, it is easy to show with a few calculations that ¥(X \ 
K(D,e)) = O(e). Using the constant n defined in Lemma[31 we obtain that F(X\K(D, ||E||f/«0) = 
0(||E|| F ). Since V a 4(y, W, a*) T (a*(x, D + E) - a*(x, D)) = 0(||E|| F ), the set X\K(D, ||E|| f /k) 
can be neglected (in the formal sense) when integrating with respect to x in the expectation of 
Eq. (|2"3")l , and it is possible to show that 

/(D + E, W) - /(D, W) = Tr (E T 5 (D, W)) + 0(||E||| ), 

where g has the form given by Eq. (|16[) . This shows that / is differentiable with respect to D, and 
its gradient Vd/ is g. I 
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